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Abstract —Modern crowd theories agree that collective behavior is the result of the underlying interactions among small groups of 
individuals. In this work, we propose a novel algorithm for detecting social groups in crowds by means of a Correlation Clustering 
procedure on people trajectories. The affinity between crowd members is learned through an online formulation of the Structural SVM 
framework and a set of specifically designed features characterizing both their physical and social identity, inspired by Proxemic theory, 
Granger causality, DTW and Heat-maps. To adhere to sociological observations, we introduce a loss function (G-MITRE) able to deal 
with the complexity of evaluating group detection performances. We show our algorithm achieves state-of-the-art results when relying on 
both ground truth trajectories and tracklets previously extracted by available detector/tracker systems. 

Index Terms —Crowd analysis, group detection, Structural SVM, Correlation Clustering, Proxemic theory, Granger causality. 
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1 Introduction 

ROWD phenomena are complex and their logic still escapes 
formal rules and precise social explanations. Eventually, 
the ambition of crowd analysis is to characterize people 
behaviors, predict and prevent potentially dangerous situations 
and improve the well-being of communities. This has been 
traditionally provided by simulation models ijTJ or automatic 
video analysis ©• Recently, groups have been recognized as 
the basic elements which compose the crowd 0. leading to 
an intermediate level of abstraction that is placed between 
two outfacing views: the crowd as a flow of indistinguishable 
people [41 and its interpretation as a collection of individuals 0- 
Identifying groups is consequently a mandatory step to grasp 
the complex social dynamics ruling collective behaviors in 
crowds. This poses new challenges for computer vision, 
since groups are definitely more difficult to characterize than 
pedestrians acting alone or as a whole. 

In this work, we propose a learning based solution for 
visually detecting groups in low/medium density crowds (Fig. [I]) 
under the hypothesis that the concept of group can be visually 
discerned and people trajectories can be extracted up to some 
extent. The strong novelty of our approach is the joint adoption 
of sociologically grounded features and a learning framework 
able to specialize the concept of group accounting for different 
scenarios, motion constraints and crowd densities. To this end, 
we adhere to a classical sociological interpretation of groups 0, 
which can be formalized as follows. 

Definition 1. A group is defined as two or more people 
interacting to reach a common goal and perceiving a shared 
membership, based on both physical and social identity. 

Accordingly, we propose a new formulation of the problem 
of detecting groups in crowds as a supervised Correlation 
Clustering (CC) [71. We solve it through a Structural Support 
Vector Machines (Structural SVM) © framework that learns a 
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Fig. 1. Examples of social groups detected in crowds. 

context dependent distance measure, based on a set of features 
inspired by Def. [I] effective on both ground truth trajectories 
and automatically obtained tracklets. The design of socially 
grounded features is one of the main contributions of the work. 

Moreover, a new socially based loss function (G-MITRE) 
is defined for the Structural SVM. Differently from previous 
solutions 0 and 0 our approach doesn’t rely on scene- 
dependent parameters that would limit the applicability of the 
method in real world contexts. Finally, we also propose an 
online learning procedure that handles smooth variation in 
crowd composition and density, useful in online surveillance. 

We annotated and made publicly available two new datasets: 
MPT- 20x100 and GVEII (see Sec. [ 7 ]). Results on standard 
benchmarks, as well as on the proposed datasets, outperform 
current methods. We strongly believe that an automatic system 
for group detection will influence future public area visual 
surveillance and will bring benefits to modeling and simulation 
application for architectural planning by providing real and 
precise data observation of crowds phenomena. 

2 Related Work 

The modeling of pedestrian dynamics in crowds represents a 
relatively recent research field. Most of the works are based on 
sociological paradigms and computer vision based approaches 
have also evolved under the influence of these theories. 

Modeling and Observing the Crowd 

Most of the research work has tried to tackle the crowd as 
an exclusively collective phenomenon, where individuality 
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does not exist. This recalls the primitive Popular Mind 
Theory ||T0]| by Gustave Le Bon, where the crowd was defined 
as a “pathological monster with no individual consciousness”. 
Accordingly, crowds have been analyzed by means of physical 
models ( e.g . hydrodynamics neglecting the existence of 
single individual purposes and goals. However, these models 
are effective mainly in extremely dense crowds. Conversely, 
many other approaches have been inspired by the 70s Social 
Loafing Theory GT which stated that individuality was a 
strong requirement for the pursuit of personal goals. Helbings 
Social Force Model |5|, which asserts that anyone movements 
towards her goals are influenced by the surrounding pedestrians, 
has been the main building block for many crowd modeling and 
analysis works, ranging from abnormal behavior detection G3 
to tracking G3- Recently, studies on people attending events 
have underlined that most of the people tend to move in 
groups and social relations influence the way people behave in 
crowds 0 G3 These empirical observations are supported by 
Reicher in the recent Social Identity Model of Deindividuation 
Effects G3 which assumes that crowd behavior is regulated 
by the social rules and behaviors groups choose to adopt. This 
is the main social paradigm underpinning our research too. 

Visual Detection of Groups in Crowds 
It was only recently that group detection showed promising 
results. The process is in fact built upon several open chal¬ 
lenges in computer vision, starting from people detection and 
tracking in crowds (l6| to analyzing and grouping extracted 
trajectories G3 

Some works employ the concept of F-formations by 
Kendon (TH) to discern group formation process. Broadly 
speaking, F-formations can be seen as specific positional and 
orientational patterns that people must sustain in order to be 
considered engaged in a social relationship. Despite robust 
results (T9), this theory is suited to stationary groups only 
and is not defined for moving groups, a case which cannot be 
ignored in crowd analysis. 

Thus, complementary approaches analyze pedestrians motion 
paths; according to the type of available tracklets, they 
can be partitioned in group-based, individual-group joint 
and individual-based. In group-based approaches, groups are 
considered as atomic entities in the scene since no higher 
level information can be extracted neatly, typically due to high 
noise or high complexity of crowded scenes (20) , (D- Since 
these models are often too simplistic to further infer on groups 
behavior, individual-group joint approaches try to overcome the 
lack of finer information by hypothesizing trajectories while 
tracking groups at a coarser level [ |22| , (23). Finally, individual- 
based tracking algorithms build up on single pedestrians 
trajectories. This kind of approach has been gaining momentum 
only recently since tracking even in high density crowds is 
becoming everyday a more feasible task Pellegrini et 
al. 0 employ a Conditional Random Field to jointly predict 
trajectories and estimate group memberships, modeled as 
latent variables, over a short time window. Yamaguchi et 
al. (24) predict whether two pedestrians are in the same group 
through a linear SVM on trivial distance, speed difference 
and time overlap information. Recently, Chang et al (25) 
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proposed a soft segmentation process to partition the crowd 
by constructing a weighted graph, where the edges represent 
the probability of individuals to belong to the same group. An 
interesting unsupervised approach is Zanotto et. al (26), where 
a potentially infinite mixture model is fitted on pedestrians, 
regarded as sampled observations from the mixture. Previous 
frames data and predictions are used as prior information for 
the models (one for each group), but pairwise relations between 
individuals are neglected as groups are modeled only through 
the mean position and velocity of their members. Above all, we 
mention Ge et al [|2) that suggests the use of an agglomerative 
approach to cluster trajectories, as we do. They hierarchically 
merge clusters by evaluating a well-founded sociological inter¬ 
group closeness measure defined on a combination of proximity 
and velocity features, stopping when a given condition is met. 

Conversely, our method does not rely neither on relative 
position or velocity fixed thresholds 0, (26) nor on sequence- 
dependent parameters 0; it is flexible and general as the 
features are not scene-specific (25]] and their contribution is 
learned from examples. Thanks to the use of a clustering 
inference rule, solutions proposed by our method are partitions 
and not coverings of the members of the crowd |[24), meaning 
that pairwise relations are consistent with the overall group 
structure found. Moreover, the use of a time window to 
predict groups let the method recognize that non-trivial 
behaviors (e.g. neglecting strict proximity) may occur, 
whereas frame-by-frame methods are limited to short term 
reasoning (26) . Yet, the discriminative nature of the employed 
framework makes learning compelling in terms of both 
required data and computational cost, as opposed to graphical 
models optimizing over a multiple hypothesis space 0. 

This work extends our preliminary attempt in [ 17]. Here we 
prove our proposal complies with social theories of group 
formation, we devise and investigate new features to better 
adhere to the sociological theory underpinning our method 
and, eventually, extend the tests to new remarkably complex 
datasets and compare with more recent competing algorithms. 
Besides, the experiments further probe the need for learning 
when dealing with heterogeneous crowds, shedding light on 
the nature of the problem itself. 

3 Problem Definition 

We cast the group detection task as a clustering problem. 
Consider a set of pedestrian M = {a, 6,... } and y(M ) as 
the set of all possible ways to partition M. Defining y as a 
subset of pedestrians (also referred to as group or cluster) in 
M, a generic set of subsets y = {y\ , 7/2 , • • • } is a valid 

solution in y(M) if the partitioning axioms are satisfied: 
Va G M,Bly e y(M) : a e y and U yey ^ M )y = M. Here, 
we call singletons those pedestrians whose cluster is composed 
by themselves only, i.e. \y\ = 1. 

In crowded contexts, this grouping cannot be solved by 
exploiting spatial (positional or orientational) information only, 
as proposed in F-formation theory, due both to confusion and 
motion. Moreover, it is often the case that the physical distance 
between a singleton and a member of a cluster is lower than 
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Fig. 2. Highlights of social groups properties: (a) hierarchical coherence, 
(b) density invariance and (c) transitivity. 


that cluster intra-member mean distance. This is due to the fact 
that, in real situations, social aspects heavily intervene in the 
group formation process. In order to obtain crowd partitions 
that are meaningful from a sociological point of view, the 
following relevant properties of social groups must hold. 

Hierarchical Coherence. Groups are composed by individ¬ 
uals and sub-groups in a recursive fashion (Fig|2a|). This has 
been first observed in the seminal work of Canetti |27|, based 
on the assumption that members within a group cannot erase 
already settled relationships as the crowd assembles. 

Density Invariance. To keep their group identities preserved 
at different crowd densities, members must be willing to 
change the inner distance among them. Groups in very crowded 
scenes will be more closed and compact, while groups in open 
spaces will tend to exhibit more dilated patterns (Fig. [2b] ); 
sociologically and empirical evidence can be found in Bandini 
et al. m and in Moussaid et al. ©• 

Transitivity. Not every member of a group needs to be 
strictly connected with every one else, but any two members 
may be part of the same group by means of a sufficiently 
dense subgroup of pedestrian standing between them (Fig. [2c| . 
McPhail and Wohlstein’s work |28) formalized this idea: to 
be considered part of a group one typically will have to be 
connected with at least half of the members. 


4 Socially Constrained Clustering for 
Groups Detection 


We propose to solve the crowd partitioning problem employing 
the Correlation Clustering (CC) |7] and we prove it is possible 
to achieve a quasi-optimal crowd partition guaranteed to satisfy 
the three aforementioned properties of Sec. [3] The CC algorithm 
takes as input an affinity matrix W where, if W ah > 0 (W ab < 
0), elements a and b belong to the same (different) cluster with 
certainty |VF a6 |. The algorithm returns the partition y of a set 
of elements M = {a, 6 ,... } so that the sum of the affinities 
between item pairs in the same clusters y is maximized: 


CC = arg max 
y ey(M) 


E E w *- 

yeya^bey 


( 1 ) 


The pairwise elements affinity in W is parameterized as 
weighted linear combination of a bounded dissimilarity measure 
and its complement: 

W£ b = a T {l -d(a,b)) - (3 T d(a,b). (2) 


To be consistent with the definition of groups of Sec. [l] we 
devise the pairwise distance between pedestrian a and b, d(a, b) 
as detailed in Sec. \5\ 

In clustering theory, changing the dissimilarity space results 
in different partitioning of the domain through the same 
algorithm. By tuning [a,/3] parameters in Eq. © we can 
evaluate many different groupings and we’ll show that, under 
a restrict set of hypothesis, they all satisfy the social properties 
previously mentioned. In order to efficiently learn those 
parameters according to different peculiarities groups exhibit in 
different scenarios, in Sec. [6] we introduce Structural SVM [29) 
with both an approximated inference procedure and a loss 
function specifically designed for accurately measuring the 
compatibility among possible crowd partitions. 

The solution to Eq. 0, given the parametrization introduced 
in Eq. © and subject to a hierarchical inference procedure, 
guarantees the satisfaction of all the social groups properties: 

Theorem 1. When the pairwise elements affinity in W is a 
weighted linear combination of a bounded similarity measure 
and its complement, a bottom-up approximated solution to CC 
produces a partition that respects the hierarchical coherence, 
density invariance and transitivity properties of social groups. 

Proof. Let d : M x M [0, l] p be a bounded distance on the 
set of members of a crowd M so that (M, d) is a dissimilarity 
space and suppose the affinity matrix of CC is constructed as in 
Eq. 0, for some appropriate positive values of a,/3 G ME To 
demonstrate that the density invariance holds for all solutions 
of CC consider that when the density increases, both distances 
between groups and between members of the same group 
diminish. This phenomenon is a less formal statement of the 
scale invariance axiom of clustering defined in Kleiberg J30[ 
which is known to hold for sum-of-pairs clustering algorithm. 
We must thus show that it holds when we are maximizing 
affinities instead of minimizing distances as well. To this aim 
let d = Ad and d : M x M [0, j] p so that 

W d = a T (l - Ad) - (3 t Ad 

= A[c* T (I-d)-/3 T d] = AWa, 0) 

where the notation for the elements is dropped for clarity. 
Consequently, CC satisfies the scale invariance axiom since 
multiplying all distances by a constant results in multiplying 
the total affinity of each cluster by a constant and hence 
the maximum affinity clustering solution is not changed. 
Transitivity follows directly from the objective function of 
CC in Eq. 0: to be assigned to the same group it suffices the 
existence of any number of members such that the net effect 
of all the involved pairwise relations is non-decreasing. Last, 
the hierarchical coherence requires a greedy approximation 
algorithm to optimize the CC that initially consider each 
pedestrian in its own cluster and then iteratively merges the two 
clusters whose union would produce the best clustering score, 
stopping when joining clusters would decrease the overall 
affinity. Hence, elements in the same cluster at lower levels of 
the hierarchy are also together in higher level clusters. □ 
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(a) Physical distance (b) Motion causality (c) Trajectory shape (d) Paths convergence 

Fig. 3. Features: physical identity (a) and social identity (b,c) provide a computational interpretation of the concept of group membership, while (d) 
evaluates the likeliness of the existence of a shared goal between pedestrians. 



Fig. 4. Proxemics, modeled by gaussians (b), reveal physical identity 
trough physical distance (a). 


5 Social Features for Social Groups 

Given the problem formulation in Sec. [3] and the CC 
parametrization of Eq. 0, here we define the distance function 
d which acts on trajectories pairs. We consider the pedestrian 
trajectory T a = {(t, p^)}t, projected onto the ground plane, 
as multivariate time series of metric (in meters) spatial 
observations p* for pedestrian a at different times t. In order to 
deal with the continuously changing nature of groups (splitting, 
merging, switching members, ...) we reduce the observation 
period to a time window T of fixed length. As a consequence, 
groups can be differently detected even between (potentially 
overlapped) sequential time windows Tk and Tk+i- 

According to Def. [T] we devise four features able to 
capture both the pedestrian physical and social identity as 
well as to discern the presence of a shared goal among them, 
namely: physical identity d v h, trajectories shape-similarity d s h, 
pedestrians causality d C2i and heat-maps dh m . A pairwise feature 
vector d fc (a, b ) is hence defined for every couple of trajectories 
T a and T\ j, and for every time window Tk , as 

d(a, b ) *= f d fc (a, b) = [d ph , d sh , d c „ d^ b . (4) 

5.1 From Physical Distances to Physical Identity 

The physical identity can be regarded as a static relation 
connecting physical distance to group membership. In his 
Proxemic Theory , Hall 131] focused on the physical interactions 
between pairs of individuals. More precisely, the theory is about 
“the study of ways in which man gains knowledge of the content 
of other men’s minds through judgments of behaviour patterns 
associated with varying degrees of proximity to them.” 

The proxemic model fomalizes how people use physical 
space in interpersonal interactions and defines a set of concen¬ 
tric bubbles around every individual, as depicted in Fig. [3a] 


TABLE 1 

Proxemics characterization as found in Flail’s Theory. 


space 

boundaries (m) 

description 

intimate 

0.0 - 0.5 

unmistakable involvement 

personal 

0.5 - 1.2 

familiar interactions 

social 

1.2 - 3.7 

formal relationships 

public 

3.7 - 7.6 

non-personal interactions 


Nevertheless, the transition between the four different proxemic 
zones is abrupt (Tab. [TJ. 

Spatial quantization can be heavily affected by noise or errors, 
leading to wrong classification. Several approaches assign a 
score to proxemic classes in order to obtain a continuous real¬ 
valued similarity measure, [ 1 ], (32) , (33) . To grasp the distance 
based characteristics of group formation, we relax the original 
Hall’s quantization by employing a Gaussian Mixture Model 
(GMM) on the ground plane, centered on person location 
and with fixed proxemics-inspired covariance matrices. The 
resulting GMM is a weighted sum of zero mean Gaussians 
with diagonal covariance matrices reflecting Hall’s boundaries 
( i.e . Si i — 0.5, Yj2 4 — 1 . 2 , ... ): 

1 4 

GMM(p* - pi) = - V(p* - Pfo|0, £*) (5) 

^ Z = 1 

Given a pair of trajectories T a and T we evaluate the mixture 
model of Eq. on the vector of distances at each time instance. 
This is equivalent to place the mixture on p* and measure 
where the point lies inside the proxemic space at each 
instant t , as shown in Fig. [4] and in Fig. ??. 

The static measure of social cohesion, called d p h, is then 
defined by averaging the mixture model responses over the 
the set of time instances where trajectories T a and T are 
simultaneously present in the current time window, T C T k : 

d ph(«> b ) = j=7 T, GMM (Pa - Pb) ( 6 ) 

Averaging is required since the physical identity among group 
members is established in time and must remain coherent in 
order to be a valid measure of social cohesion. 
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5.2 Motion as an Indicator of Social Identity 

Social Identity (6), (34) is a psychological paradigm built 
on the intuition that group behavior is an emerging dynamic, 
reflecting a shift in self-conception of the members who start 
to define themselves in terms of their common membership. 
According to (35) , social identity reflects in the way people 
mutually influence each other and consequently move in groups, 
suggesting that social identity could be observed through 
trajectories shape similarity and paths temporal causality. 

5.2.1 Temporal Causality 

Under the hypothesis of sufficiently stationary trajectories, 
which is typically true for the observation of a time window, we 
can employ the econometric model of Granger causality (36| to 
measure to what extent pedestrians are mutually affecting their 
motion paths (37]]. Accordingly, we formalize two requirements: 

1) the causal pedestrian will move before the effect pedes¬ 
trian, and 

2) the motion of the causal pedestrian contains information 
about the way the effect pedestrian moves that cannot 
be found in any other pedestrian motion. 

A consequence of these statements is that the causal pedestrian 
trajectory can help forecast the effect pedestrian trajectory 
even after other data has first been used. Let’s define m as the 
lag value for the causality analysis and denote the optimum 
least-squares predictor of a stationary trajectory T a at time 
t using the set of values T a {t — m) by P t (T a \T a (t — mn)). 
Here T a (t — m) is all the information about trajectory T a 
accumulated since time t — m (inside the current time window 
T k ) up to time t — 1. The predictive error series will be denoted 
by e t {Ta\f a {t - m)) = T a {t) - P t (T a \T a (t - m)) and define 
<r 2 (T a |T a (t — m)) as the variance of e t (T a \T a (t — mn)). It is 
said trajectory TJ, Granger causes T a , briefly b a, if 

<r 2 (T a \T a (t - to)) > a 2 (T a \f a (t - m),T b (t - m)) (7) 

The feature is then derived from a specific testing procedure 
used to evaluate Granger causality trustworthiness. Let’s 
introduce the sum of squared residuals for the constrained 
and unconstrained models as 

K 

RSS C = y2 £ t(T a \T a (t - to)) 2 and 

X < 8 > 

RSS U = Y / ^t(T a \f a (t-m),f b (t-m)) 2 , 

t= 1 

where K is the number of samples considered for the analysis. 
We design our feature d CSL so as to be the critical confidence 
measure of the hypothesis that Granger causality exists between 
T a and T&. To this end, we consider the test statistic 

c _ {RSS C — RSS u )/m 
h ^ a RSS u /(K-2m-iy W 

and compute the area under the Fisher-Snedecor probability 
function T to the left of S, as shown in Fig. [ 5 ] This results in 
the following closed form solution (38| integral: 

f s 

dc&( a ib) = s ^max^ ^ J J r (x\m,K — 2m — l)dx, (10) 
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X 


Fig. 5. Visual example of causality probability. The vertical line is the S of 
Eq. <[9} while the shaded area is d ca - 

where Sb_> a and are both considered in order to obtain 

symmetry, but as we value the existence of causality over its 
direction, we only keep the one which maximize the probability. 

5.2.2 Shape Similarity 

Shape similarity may also be useful in describing social identity 
as it overcomes the limit of the proxemics punctual and static 
evaluation. We use the Dynamic Time Warping (DTW) ] [39) 
on euclidean coordinates to map one time series to another by 
minimizing the distance between the two. In particular, DTW 
flexibility allows two time series that are similar but locally 
out of phase to align in a non-linear manner. Suppose we have 
two trajectories T a and T\l of lengths A and B respectively. 
To align these two sequences using DTW, we first construct 
a distance matrix {D^ b }ij G R AxB that encodes the squared 
euclidean distance between any i-th element of T a and j-th 
element of T & inside the current time window. 

The best alignment can be found by a recursive minimization 
of the cumulative cost y a b of any path through the distance 
matrix originating in D^: 

( 11 ) 

In particular, we construct our feature to be the distance of the 
two sequences once they are optimally aligned, that is the sum 
of the Euclidean distances of associated points of T a and T&: 

d S h(a , b) = 7 ab(A, B)/ max(74, B) (12) 

where the denominator is the optimal warping path length used 
as a normalization factor. 

5.3 Common Goals from People Motion 

Previously described features focus on both static and dynamic 
aspect of trajectories when groups are already established, but 
neglect the smooth process of group formation. People may 
merge in groups starting from different location ( e.g. meeting 
action) or groups may split into subgroups and singletons 
(according to the hierarchical coherence property of group 
formation). Meeting or being close for a sufficient amount of 
time may indicate the presence of a shared goal. Following 
the results in [ |4Q| , where heat maps were used to recognize 
group activities, we also employ a heat map inspired feature 
to holistically model groups. 

A heat map H a : N# x N c —^ [0,1] associated to the 
trajectory T a is a R-by-C grid of heat sources h a that partitions 
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Fig. 6. Intersecting heat maps are generated by converging trajectories, 
which project on the xy plane their shared goal. 


alone and their similarity/dissimilarity optimal combinations, 
resulting in different clustering rules. 

The choice of the best rule should account for all factors 
affecting the group formation process, such as environmental 
constraints or cultural influences. The complexity of explicitly 
evaluating these factors resides in the impossibility to directly 
observe them. Still, we can gain important insights by observing 
the grouping process. On these premises, we adopt a learning 
framework capable of choosing the most suitable clustering 
rule by finding a set of feature weights that implicitly embodies 
these non-observable aspects. 


the ground plane. The heat source h a (i,j ) activates if the 
trajectory T a happens to walk in the relative grid cell (i, j) 
and once activated it is subject to thermal decay and thermal 
diffusion processes: 

r c 

H a (i,j) = llb-^-i)H, (13) 

P= 1 < 7—1 

where k s is a parameter suggesting the relative importance 
of different patches at different distances and E a (p, q) is the 
thermal energy produced by T a on the patch (p, q ). If we let 
E a (p,q) be the accumulated thermal energy, we have 

E a (p,q)=E a (p,q)-e- k (14) 

being k r a parameter regulating the slow down of the heat 
accumulation and dispersion and the duration of the 
interaction between pedestrian a and cell (p, q) inside the 
current time window T k . 

Once we have constructed heat maps for every trajectory, we 
define a similarity metric between two trajectories T a and T b 
as the volume under the combined heat surface T ab obtained 
as the pointwise product of the two heat maps H a and H b \ 

r c r c 

4e( a > & ) = 

i =1 i=l i=l i=l 

(15) 

The volume under T ab reveals to what extent T a and T b have 
been close in space during the observation period, something 
that proxemics could already measure indeed. Nevertheless, 
heat maps relax the constraint by which only elements from the 
same frame can be compared, in practice this is accomplished 
through the thermal diffusion process. At the same time, heat 
maps also expose the history of their respective trajectories, 
allowing the metric to capture the temporal aspect of motion 
similarity. Proxemics, DTW and Granger causality would rate 
two pedestrians meeting and parting ways analogously, even if 
the former case is more likely to represent a group formation 
process. Recognizing motion trajectories also encode temporal 
information is a great advantage of heap maps based analysis. 

6 Learning Framework 

The linear parametrization of the affinity matrix Wd of 
Eq. ([2]) guarantees to reach a partition of the crowd which 
is consistent with the social groups properties. The parameters 
w = [a,/3] govern both the importance of each feature 


6.1 Supervised CC Through Structured Learning 

Let us consider the input = {[1 — d z (a, 6); cF(a, b)]} a ^ to be 
the set of pairwise features computed on all the possible pairs 
of trajectories T a and T b in the i-th temporal window and 
the clustering solution, i.e. the set of all social groups appearing 
in the crowd Mi. Since y i cannot be described by a single 
valued function, we adopt the Structural SVM p9| framework 
to model and learn predicting the solution. The goal is to learn 
a classification mapping / : X —>> y between input space 
and structured output space y given a set of input-output 
pairs {(xi, yi),..., (x n , y n )}. A discriminant score function 
F : X x y —>> M is defined over the joint input-output space 
and F(x, y) can be interpreted as measuring the compatibility 
of x and y. Now, the prediction function / can be defined as 


/(x) = arg max F(x, y) (16) 

yey(x) 

where the maximizer over the label space y(x) is the predicted 
label, i.e. the solution of the inference problem. For simplicity 
we choose to restrict the space of F to linear functions over 
some combined feature representation T^(x, y) subject to a w 
parametrization. This feature mapping cannot be defined out 
of the context of the problem, as it is the problem itself that 
specifies, given a particular input, the nature of the desired 
solution. Following the definition of correlation clustering 
in Eq. [T] and its parametrization introduced in Eq. [2j the 
compatibility of an input-output pair is directly described as 


F(x,y;w) = w r T'(x,y) = w : 


E 


r E 

yeya^bey 


r ab 


(17) 


The problem of learning in structured and interdependent output 
spaces can been formulated as a maximum-margin problem. 
We adopt the n- slack, margin-rescaling formulation: 


mm 

w,£ 

s.t. 


\\ M \ 


/o n 

-E«. 

n ^ 

i=i 


Vi : & > 0, 

Vi,Vy e y(xi)\yi : w T <54' i (y) > A(y,yj) 


( 18 ) 

where 5Vi( y) = T(x,,y,) - T(x, : . y), & are the slack 
variables introduced in order to accommodate for margin 
violations, A(y^,y) is the loss function further defined in 
Sec. 6.3 and C is the regularization trade-off. Intuitively, we 
want to maximize the margin and jointly guarantee that for a 
given input, every possible output result is considered worst 
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than the correct one by at least a margin of A(y^,y) — 
where A(y^,y) is bigger when the two predictions are known 
to be more different. 

Remarkably, correlation clustering doesn’t need to know in 
advance how many groups are present in the scene. Moreover, 
a positive overall cluster score can group two elements even 
if their affinity measure is negative, implicitly modeling the 
transitive property of relationships in groups, as stated in Sec. [3] 

6.2 Batch Sequential Optimization 

The quadratic program (QP) fj~8] ) introduces a constraint for 
every possible wrong clustering of the n examples, more 
precisely X^=i(l^( x OI — 1)- Unfortunately, the number of 
ways to partition a set M scales more than exponentially with 
the number of items according to the Bell sequence ED 

\M\ i /lA 

= < 19 ) 

i= 0 *• j= 0 w 

making the optimization intractable. As an example, for a 
crowd composed of 20 pedestrians the number of potential 
solutions would be about 5.8 • 10 12 . In order to deal with this 
high number of constraints many approximation schemes have 
been proposed, where cutting plane algorithms or subgradient 
methods are among the most commonly used. In particular, all 
the constraints of QP ( f]~8j ) can be replaced by n piecewise-linear 
ones by defining the structured hinge-loss: 

-ff(xi) = maxA(yi,y) - w T <5^(y). (20) 

yey 

The computation of the structured hinge-loss for each element 
i of the training set, described in Sec. |6.4[ amounts to finding 
the most “violating” output y for a given input x z and its 
correct associated output y 2 . We only have n constraints of 
the form > H{^i) and the non-smooth version of QP ( p~8] ) 
reduces to 

i n n _ 

min —1| w|| 2 + — (21) 

w z n z ' 

2 = 1 

By disposing of a maximization oracle, i.e. a solver for Eq. ( [20] ), 
and a computed solution y*, subgradient methods can easily 
be applied to QP ( [2l] ), being <9 w i7(x^) = — £\E^(y*). 

To exploit the domain separability of the constraints and 
limit the number of oracle calls needed to converge to the 
optimal solution, we choose to adopt a Block-Coordinate 
version of the Frank-Wolfe algorithm (BCFW) (42| , delineated 
in Alg. 0. The algorithm works by minimizing the objective 
function of Eq. © but restricted to a single random example 
at each iteration. By calling the max oracle upon the selected 
training sample (line 4) we obtain a new sub-optimal parameter 
set w s by simple derivation (line 5). The best update is then 
found through a closed-form line search (line 6), greatly 
reducing convergence time compared to other subgradient 
methods. 

In order to solve QP © effectively, it is important to choose 
an appropriate loss function as the learning ability of Structural 
SYM highly depends on it. In Sec. |6.3| we introduce and 


Algorithm 1 Block-Coordinate Frank-Wolfe Algorithm 


Let w(°),w-°) := 0 and := 0 

for it := 0 to maxlterations do 
Pick i at random in {1,..., n} 

Solve y* := argmax yey A(y i 5 y) - w r <55' i (y) 
Let w„ := £(5<Li(y*) and l s := £A(y i; y*) 


Let 7 := 


(w7-w a ) T w< i ‘>+g(i a -;f Y ) 


|wf-w 

Update w ! '" +11 — 


and clip to [ 0 , 1 ] 


:= (1 - 7 )w> lt} + 7 W S 
and lf +1) := (1 - 7 + 7 l s 
8 : Update w (a+l * := w (IL * + w | lt+1 ’ — w 99 

and (( it+1 ) := l W + lf +1) - lf l) 

9: end for 


discuss different potential loss functions and their respective 
descriptive ability. Given the loss function, in Sec. |6.4| an 


efficient method to compute the maximization oracle (line 4 
of Alg. [I]) is described. 


6.3 Loss Function and Scoring Procedure 

One common choice of loss function for clustering is the 
pairwise loss Apy^(y^y), which is a generalization of the 
Rand coefficient (43), and is defined as the ratio between the 
number of pairs on which y^ and y disagree on their cluster 
membership and the number of all possible pairs of elements in 
the set. Due to the quadratic number of connections that exist 
among crowd members, this measure tends to be imprecise 
when dealing with large crowds: as the crowdness increases, the 
number of positive links connecting group members becomes 
negligible with respect to the total number of links. As a 
consequence, erroneous solutions won’t be strongly penalized. 
The MITRE loss g), A M (y^y), founded on the understand¬ 
ing that connected components are sufficient to describe groups, 
partially mitigates this problem by representing groups as 
spanning trees, instead of complete graphs, inducing a linear 
amount of both positive and negative links among members 
(and not quadratic as in the pairwise case). For any crowd 
partitioning, a spanning forest is an equivalence class as many 
trees that describe the same group configuration may exist. The 
final score is obtained by accounting for the number of links that 
needs to be removed or added to recover a spanning forest of 
the correct solution. Nonetheless, problems arise when working 
on relations and not directly on members, as singletons have 
no connections at all but should still be considered positively 
when correctly classified. 

For this motivation, we propose a loss function, GROUP- 
MITRE loss (G-MITRE) AcMiyi^y), that overcomes this 
limitation by adding, for each pedestrian described by the 
trajectory T^, a fake counterpart to which only singletons 
are connected. Through this shrewdness we can now take 
into consideration singletons as well when computing the 
discrepancy between two solutions. The particular design 
choice to link to the fake counterparts only singleton members 
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(a) yi PAIRWISE links 


& 



(c) MITRE links 



< 0 *' 


<€b 



(b) y, A PW (yi,y) = 0.27 



Algorithm 2 G-MITRE loss AcM{yi,y) computation 
Require: y i and y as disjoint-set data structures 
1: (p(x ) are the unique roots of connected components x 
2: T(x) is the size of the connected component with root x 

3: for all T e yi/y do 

4: yi/y = yi/y u a T 

5: if r(FlND(yi/y(T)) = 1 then 

6: UNION(y i /y(T),y i /y(a T )) 

7: end if 

8 : end for 

9: for all q G <p(y»/y) do 

10: ^y./y + = l^(UFIND( yi /y(T))= g y/y*( T ))l - 1 

c y;/y + = r((?) — 1 

12 : end for 



(e) yi G-MITRE links (f) y, A GM (y i? y) = 0.75 

Fig. 7. Differences in the way losses account for errors. Singletons are 
white. Figures (a, c, e) depict solution and the links considered by the 
respective losses, while (b, d, f) color pedestrians according to solution y 
and show the links on which the two solutions y* and y disagree. 


generates two discrepancies when committing errors involving 
singletons and is thus a further effort in generating more 
plausible hierarchical groups in the solution, as depicted in 
Fig. [7] More formally, consider two clustering solutions y 
y and a representative of their respective spanning forests Q 
and R. The connected components of Q and R are identified 

respectively by the set of trees Qi, Q 2 , ■ • • and R\, F 2 ,_ 

Note that if the number of elements in Qj is \Qj\, then only 
c(Qj) = |Qj | — 1 links are needed in order to create a spanning 
tree. Let us define tt R (Q j) as the partition of a tree Qj with 
respect to the forest R , that is the set of subtrees obtained by 
considering only the membership relations in Qj also found 
in R. Besides, if R partitions Qj in \7r R (Qj)\ subtrees then 
v(Qj) = \7r R (Qj)\ — l links are sufficient to restore the original 
tree. It follows that the recall error for Qj can be computed as 
the number of missing links divided by the minimum number 
of links needed to create that spanning tree. Accounting for 
all trees Qj the global recall measure of Q is: 

^ n T,j v (Qj) _ Ylj \Qj\ — l 7 r n(Qj)| 

~m~ e, io.i-i <22) 

The precision of Q (recall of R) can be computed by 
exchanging Q and R. Given the definition of precision, recall 
and employing the standard F-score Fi, the loss is defined as 

(23) 


13: 7Zy./y — 1 Vy./y / Cy./y 

14: A(y i? y) = 1 - 21ZyJZyl(JZy x +F y ) 


The complete algorithm for the computation of the G-MITRE 
loss is reported in Alg. [2j We employ disjoint-set arrays 
due to the efficiency of checking whether two pedestrians 
belong to the same group. Recall that UNION and find are 
the standard functions defined over the disjoint-set arrays and 
denote the operations to merge two clusters and to find an 
element membership respectively. In the pseudo-code we use 
the notation yi/y to indicate that the algorithm first work on 
the solution y i and then analogously on y. 


6.4 Approximate Oracle 

Despite the simplicity of the algorithm, the intrinsic complexity 
of the optimization is hidden in the search for the most violating 
solution y* for the i-th example (line 4 of Alg. Q): finding the 
most violated constraint requires to solve the loss augmented 
decoding subproblem. Note that the original prediction problem 
of Eq. fib] ) is NP-hard and the insertion of a non-linear loss 
in the computation of the maximum is not likely to help. 
Nevertheless, thanks to its iterative nature, the inference scheme 
introduced in Sec. [4] can be adapted to approximate the oracle 
as well. Starting from the trivial solution having each pedestrian 
of the i-th example in its own cluster, the algorithm repeatedly 
merges the two clusters which reflect in the highest increment 
in the structured hinge-loss F(x^) of Eq. ( [20] ), until a local 
maxima is found. 

Of course by following a greedy procedure, there is no 
guarantee to select the most violated constraint. Interestingly 
enough, Lacoste-Julien et al. |42| show that all convergence 
results known for exact maximizer of the loss augmented 
problem also hold for approximate maximizers by allowing 
the algorithm to iterate longer toward convergence. For further 
details, please refer to their original work. 



A gm = 1 - Fi . 
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TABLE 2 


Comparative results on publicly available dataset using the G-MITRE loss of Sec. 6.3 and the positive pairwise loss Aof 


26 



our m 

V 

lethod 

n 

bast 

V 

dine 

n 

71 

¥7 

n 

V 

-41 

n 

J 

V 

16] 

7^ 

r~t 

V 

7^ 

BIWI A GM 

973 ±0.7 

97.7 ±L5 

71.0 ±8.1 

69.6 ±7.4 

89.2 

90.9 

84.0 

512 



67.3 

64.1 

h O t G1 

89.1 ± 12 

91.9 ±15 

47.6 ±9.2 

88.6 ± 8.6 

88.9 

89.3 

83.7 

93.9 

81.0 

91.0 

515 

90.4 

BIWI Agm 

91S ±12 

942 ±03 

72.4 ±44 

652 ±3.4 

87.0 

84.2 

60.6 

76.4 



69.3 

682 

eth A 

91.1 ±04 

834 ±06 

39.1 ± 8.4 

912 ±1.7 

80.7 

80.7 

72.9 

78.0 

79.0 

82.0 

445 

87.0 

CBE A gm 

81.7 ± 02 

825 ± 02 

59.9 ±2.9 

535 ±6.8 

77.2 

73.6 

56.7 

76.0 



40.4 

48.6 

student 003 A p W 

823 dz 03 

741 ±02 

24.0 ±9.7 

49.3 ± 12.9 

72.2 

65.1 

63.9 

72.6 

70.0 

74.0 

10.6 

76.0 


7 Experimental Results 


We designed several experiments to evaluate the algorithm 
behavior on well-assessed benchmarks and its connections to 
the nature of the problem. All the experiments were carried 


out on ground truth trajectory data, except for Sec. |7.4| where 
the method is evaluated on tracklets extracted by a modern 
detector/tracker system. We also propose new video sequences 
to stress the algorithm over a variety of challenges in real 
world scenarios. Since the method works on ground plane 
(metric) data, we also provide homography information for all 
the employed sequences. 


Datasets 

We selected two publicly available datasets, namely the BIWI 
Walking Pedestrians dataset (45) and the Crowds-By-Examples 
(CBE) dataset (46). The former dataset records two low crowded 
scenes, outside a university and at a bus stop (eth and hotel 
in Tab. [3]). The CBE dataset records a medium density crowd 
outside another university (student003, briefly stu003) 
providing some challenges: the density of the pedestrians is 
significantly high and the presence of multiple entry and 
exit points. While BIWI and CBE are standard datasets in 
crowd analysis, we also use the more recent Vittorio Emanuele 
II Gallery (VEIIG) dataset EZt from which we extracted 
a five minutes subsequence, gall, particularly interesting 
due to the fast and continuous change in crowd density. 
We also propose a new dataset to cope with the increasing 
variety of application in dense-crowd management, MPT- 20x100, 
composed of 20 sequences of 100 frames where we manually 
annotated trajectories and social groups. The dataset comprises 
different videos (48) all characterized by a high number of 
pedestrians with an heterogeneous set of scene conditions, 
ranging from density, scale, viewpoint and type of interactions, 
like walking in a mall, crossing the street or participating at 
public events. 

In Tab. [3] we report some measures useful to characterize the 
spatial complexity of the datasets: 

• di n is the group compactness , computed as the mean 
distance between members of the same groups; 

• do U t is the group isolation or the mean distance between 
each member and its closest unrelated pedestrian; 

• the ratio dy Q = ck n /d out measures crowd collectiveness : 
small values mean compact groups in a sparse crowd. 


TABLE 3 

Datasets: number of pedestrians (#p), groups (#g) and density metrics. 



#P 

#g 

ck n ( m) 

^OUt ijtl) 

di/o 

student003 

406 

108 

0.41 

0.70 

0.59 

eth 

117 

18 

0.99 

2.79 

0.35 

hotel 

107 

11 

0.75 

2.00 

0.38 

gall 

630 

207 

0.77 

1.66 

0.46 

MPT-20xl00 

82 

10 

0.63 

1.45 

0.48 


Evaluation Scheme 

There is no consensus on which metrics should be used to 
evaluate groups correctness: we propose to use the G-MITRE 
precision V and recall 1Z since it accounts for the correct 
classification of singletons as well. This is an important gain 
as in crowded scenes the number of people walking alone 
is rarely negligible. Each measure is reported in terms of 
mean and standard deviation over 5 runs to account for the 
stochastic nature of the training of our algorithm. Where not 
differently specified, we used a 100s for training and a 10s 
sliding window with no overlap for features computation. The 
regularization parameter C of QR (18] ) is fixed to 10. 


For the heat-map based feature of Sec. 5.3 we run a grid 
search on the parameters. For all the experiments, the length 
of the cells edge is fixed to 30cm, k 8 = 10 -5 and k r = 0.5. 


7.1 Baseline and Benchmark Comparisons 

We compare our method with three recent state of the art group 
detection algorithms, namely gTJ, (24), (26), selected on 
the basis of their reported performances on public datasets and 
availability of code. In addition, we devised a simple baseline 
version of our solution that performs the group partitioning 
with no use of the learning framework. The weights are 
randomly chosen to be the same for all the features, so that 
the randomness resides in the similarity/dissimilarity ratio. 


7.1.1 Quantitative Results 

Quantitative results are given in Tab. [2] To highlight our 
algorithm superiority, results are presented both in terms of 
G-MITRE and a pairwise loss accounting only for positive 
(intra-group) relations but neglecting singletons, A p W [26]. The 
latter loss is not directly optimized by our algorithm, still our 
method outperforms the competitors in all the tested sequences. 
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TABLE 4 

Evaluation of our proposal when trained with different loss functions. 



Pairwise A pw 

MITRE A m 


V 

n 

V 

n 

hotel 

90.1 zb 2.0 

84.1 ± 3.2 

89.2 ±3.0 

93.2 ±1.9 

eth 

88.7 ± 1.8 

87.3 ±2.6 

91.9 ±0.8 

92.9 ± 1.0 

stu003 

68.9 ± 1.4 

69.9 ± 1.5 

80.1 ±2.4 

80.9 ±2.3 


This can be explained through the ability of our algorithm to 
adapt the concept of groups to always different scenario by 
varying the feature importance and the use of sociologically 
inspired similarity functions. The slightly lower performances 
on the stu003 sequence are due to the high complexity of 
the scene: the high value of the dy 0 ratio in Tab. [3] suggests 
the presence of loose groups in a dense crowd and, as such, 
challenging to be detected. 

7.1.2 Evaluation of Different Loss Functions 

As structured learning relies upon a definition of what's wrong 
to learn how to classify well, the choice of the loss function can 
greatly affect the final performances. By fixing the G-MITRE 
measure as a proper scoring scheme, we quantitatively test 
the influence of the choice of the loss on the eth, hotel 
and stu0 03 datasets (Tab. [4}. As it could be expected by its 
definition, the improvement due to the use of the G-MITRE loss 
(reported in Tab. [2]) is greater in the eth and hotel sequences 
where the ratio between the number of singletons and the people 
walking in groups is higher and as such learning to classify 
them as well becomes crucial. More interestingly, we observe 
how the pairwise loss obtains outstanding performances when 
the number of pedestrians is limited, but becomes ineffective 
when it starts to grow, as in stu003. 

7.2 Features Weight Learning on MPT-2 0x100 

CBE and BIWI datasets expose some interesting challenges of 
the problem but, with the only exception of stu003 sequence, 
they have a limited number of pedestrians in scene and a 
low crowd density. Moreover, the scenarios are similar and 
the variety of interactions underlying the group formation is 
limited. The proposed MPT- 20x100 datasets, on the other hand, 
presents different degrees of complexity. 

First, we evaluate the general performance of the algorithm 
and compare with both our baseline and the proposal in [2] 
where, for the latter method, we manually tuned the thresholds 
to achieve best results. These methods are clustering based, 
partially consistent with the social group axioms but no learning 
is employed. Results are shown in Fig. [8] as a survival curve plot 
which reveals on how many sequences the algorithms where 
at least able to reach the specific lower-bound performance 
and per-video scores are in Fig. [9] Interestingly, the difference 
between our method and increases here with respect to 
the previous datasets on an average of 10%, suggesting that 
sequences can be really different in the concept of groups they 
embed and thus learning is mandatory to adapt to this new 
representations of social groups and keep performances stable. 



number of videos 


Fig. 8. Comparison against baseline and [2] on MPT-20xl00. 
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Fig. 9. Results on MPT- 20x100 highlight the complexity of each scene. 


7.2.1 The Need for Learning from Examples 

The confusion-like matrix, depicted in Fig. [lOj presents 
the F-l scores obtained by training the algorithm on one 
sequence of MPT- 20x100 (row labels) and testing it on all 
the other sequences (column labels). By reading the matrix, 
and averaging each row over all the columns, it is possible to 
grasp how good a particular sequence was for training. At the 
same time, by observing the average of the columns over all 
the rows, we can get intuition about how much each sequence 
was effectively predicted by all the others. 

We are interested in understanding whether a specific notion 
of group is shared across sequences and how it is influenced 
by both scene elements (e.g. crowd density) and unobserved 
aspects (e.g. intentions and social hierarchies). 

With the purpose of capturing these invariants, we search the 
connected component of the matrix using the F-l score as the 
affinity value among elements. Clustering is performed through 
an asymmetric version of spectral clustering |49| based on the 
Random Walk Laplacian defined as 

L = AD ~\ (24) 


where A is the affinity matrix defined as in Fig. 10 and D is 
the usual degree matrix. Following the eigen-gap heuristic we 
found 4 distinct clusters in the MPT- 20x100 dataset, highlighted 
with black lines in Fig. [lOj for every cluster we computed the 
din, d ou t and dy 0 spatial measures, displayed in Tab. [5] to verify 
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Fig. 10. F-1 scores obtained by all combinations of train/test pair 
sequences in MPT-2 0x100. Results were clustered (diagonal blocks C1- 
C4 from left to right) to highlight similar notion of group among sequences. 


TABLE 5 

Spatial depiction, training efficacy and groups predictability of the 
clusters of sequences of Fig.pTo] 


cluster 

dm (m) 

^OUt (m) 

d\io 

Fi train 

Fi test 

Cl 

0.58 

1.03 

0.54 

0.82 

0.82 

C2 

0.59 

1.28 

0.47 

0.85 

0.84 

C3 

0.59 

0.99 

0.59 

0.77 

0.64 

C4 

0.89 

3.00 

0.34 

0.75 

0.84 


if clusters with a similar notion of group also share a common 
configuration of distances among pedestrians and possibly if 
the performance are connected to crowd density. 

Tab. [5] also reports a measure of training efficacy (Fi train), 
computed as the mean accuracy obtained on the whole dataset 
when only sequences in that specifi cluster were used for 
training and, analogously, a group predictability score (Fi test) 
or the mean accuracy obtained on the sequences of that cluster 
when all the sequences were used for training. They indicate 
how much a cluster is useful during training and easy it is to 
predict groups inside its sequences. 

A first observation that can be made is about the cluster 
C4, which presents the highest Fi test and the lowest Fi train. 
We found it was easy to predict groups in these videos but 
they were poorly informative as training examples, a result 
justified by its small dy 0 . Nonetheless, clusters 1 and 3 exhibits 
very similar dy 0 ratio but perform very differently in terms 
both of training efficacy and testing score, suggesting a trivial 
heuristic based on spatial information only is insufficient to 
visually discern groups. Implicit aspects like motion constraints 
or cultural and social context also affect the group process 
formation, defending our hypothesis that learning is needed to 
adapt the concept of group to the current data. 

7.2.2 Do we Capture the Essence of Being a Group? 

As previously stated, MPT- 20x100 comprises very different 
scenarios and situations and can provide important insights on 


i ■ proxemics i i DTW 

i i causality i i heat maps 



Fig. 11. Features normalized coefficients of Eq. 


which are the most important elements that reveal groups. To 
this end, recall the definition of feature vector w = [a,/3] = 
[wi,w 2 ,... ,wg\ from Eq. © of Sec. [4] is such that the affinity 
between two trajectories T a and T/, can be written as: 

Wf = a T (l -d(a,b)) -/3 T d(a,b) 

= w 1 +w 2 +w 3 + w 4 -[(w 5 + Wi)d ph + ... 

(u>6 + W 2 )d sh +... ^ 
(u>7 + W 3 )d ca + ... 

(ws + w 4 )d he } 

" -v-' V -v-' 

constant term (a, b) -dependent term 


The contribution of each feature to the score, transformed 
from a distance to an affinity measure by the constant term of 
Eq. ( [25] ), is encoded in the absolute value of the coefficient of 
the features themselves. 


As shown in Fig. 11 the proxemic inspired feature d p h 
dominates all the others while the importance of the remaining 
features vary greatly from sequence to sequence. The two 
sequences lmanko3 (Fig. ED and Idaweil (Fig. E), for 
example, present very similar contribution from d^ m and 
d s h, while the importance assigned to d p h in Idaweil is 
shifted to d ca in lmanko3. The former sequence present a 
particularly sparse crowd, making distance among elements 
a strong peculiarity of groups, but when the space among 
pedestrian is reduced both intra and inter-groups distances 
(and consequently d v h) become less significant. Conversely, 
the causality feature d ca becomes more important when the 
density increases as pedestrians tend to follow each others 
to avoid getting separated from the rest of the group. Heat 
maps importance gain emphasis from comparing lmanko3 and 
3shatian6 (Fig. ED, as they are very helpful in decoupling 
trajectories that stand very close in space but for a very limited 
amount of time. In particular, in lmanko3, people crossing 
from opposite sides of the road tend to be very close when 
meeting in the middle, even if they are not in the same group. 


7.3 Evaluating the Influence of Density Changes 

In this test setting we evaluate if the feature weights learned 
by the Structural SVM of Sec. [6] are sufficiently general to 
deal with crowds at different densities and, at the same time, 
understand whether an online version of Alg. [I] would bring 
any accuracy improvement. To this end we introduce a new 
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TABLE 6 

Performance of detector [50], tracker [51] and group detection algorithms (in terms of G-MITRE) in a fully automatic pipeline. 



Deli 

V 

:ctor 

n 

MOT(A/P) 

Trackei 

MT 

r 

IDS 

FRG 

our pi 
V 

roposal 

n 

ng 

V 

p 

n 

A 

V 

n 

A 

V 

“L 

n 

hotel 

43.1 

52.4 

66.9 / 0.88 

18.8 

120 

34 

77.9 

76.9 

75.7 

78.0 

46.3 

38.6 

60.2 

57.5 

eth 

68.2 

53.7 

92.3 / 0.08 
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Fig. 12. Pedestrians number and d\ /0 ratio temporal evolution in the gall Fig. 13. F-1 score comparison between differently trained version of the 
sequence of GVEII. our method on gall of GVEII. 


video sequence, gall from GVEII , containing an average 
number of 70 pedestrians simultaneously present in the scene. 
The distribution of pedestrians is not uniform though, and 
increases over time, as well as for their density, represented 
by the dy 0 ratio (Fig. 12). In order to underline the importance 
of capturing changes in density, we compare the batch version 
of the training algorithm Alg. [I] with a sequential and a fully 
online version (Fig. ED- In the former case, examples are fed 
to the supervised training procedure in temporal order one at a 
time, while for the latter case, the weights have been initialize 
to the ones learned batch and the algorithm at each step learns 
from the previous prediction, thus without supervision. 

The plot in Fig. [13] shows the performance of the batch training 
version tends to decrease as the crowd density increases. While 
the sequential version of the algorithm performs better, it 
is slow to respond to sudden density changes like in time 
windows 15. Indeed, a non-smooth density variation affects 
negatively the training process, leading to a performance 
drop further recovered in the subsequent temporal windows. 
Eventually, this behavior is partially mitigated in the fully 
online version. The higher performances are motivated by 
the implicit regularization: using the prediction as training 
input discourages the learner to drastically modify the weights 
vector and mimic the smooth variation in crowd density slightly 
adjusting in time. 


7.4 Performances on Real Detector and Tracker 

Our algorithms assumes the availability of correct trajectories 
to detect groups, but what happens in a fully automatic video 
surveillance pipeline where a people detector and tracker are 
employed? We carried out experiments by extracting pedestrian 


positions through a state of the art detector ]50| and obtaining 
trajectories by means of a continuous energy minimization 
method ED We compare with Ge et al. 0 . Yamaguchi et 
al. (24) and Shao et al. ED over the same input data, results 
are shown in Tab. [6] Our proposal outperforms the competitors 
even in the case of noisy trajectories. 


Tracking performances evidence a high number of tracks 
fragments, namely FRG, that are mainly due to the localization 
error introduced by the automatic people detector on non¬ 
trivial crowded scenes. FRGs are proportional to the number 
of small new tracks created by the system instead of correctly 
associating previously tracked objects, with the consequence 
of splitting ideal tracks into temporally disjoint segments. 

A high FRG number affects the group detection performance 
as the d p h and d ca features are computed when the trajectories 
are simultaneously present in the scene and thus merging 
temporal disjoint fragments is strongly discouraged by the 
correlation clustering algorithm. Intuitively, by reducing the 
size of the window we are able to minimize the number of 
split trajectories at each example and recover most of the 
original performances, as shown in Fig. 14 c). The improvement 
is basically achieved through the joint adoption of socially 
founded features and structural learning that weights the 
features according to the observed noisy trajectories. The 
experiment allow us to conclude that even in the case of 
a real application and imprecise input data the strengths of the 
proposed algorithm are maintained because are strongly related 
to the social rules that govern the group formation process, 
these rules are not data dependent and hold despite the applied 
feature extraction techniques. 
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Fig. 14. Group detection results on student003 are displayed when corrected tracks are used (a) and when input with people detector and tracker 
automatic responses (b). Regardless of the input noise, most of the groups can still be identified. This is due to the robustness of the features 
employed during learning and to the decrease in length of the time window (c) which prevents fragmented tracks to be split in different groups. 



(e) 3shatian6 (f) seql (g) eth (h) hotel 

Fig. 15. Examples of groups detected through our method: sequences from (a) to (e) are from the MPT- 20x100, while (f) is part of GVEII and finally, 
(g) and (h) belong to the BIWI dataset. Groups are identified regardless of the scene context and errors are visually acceptable, as in (d). 


8 Conclusion 

In this work, we pointed out the need to approach the task of 
detecting social groups in crowds from a learning perspective. 
Many existing methods rely on specifically tuned parameters 
that limit their applicability in real world scenarios. Our 
intuition is that there are crowds that preserve the same concept 
of social group, but in many cases this concept cannot be 
distilled from spatial consideration only. We thus defined a 
set of social-inspired and strongly motivated features able to 
capture and characterize different groups peculiarities. To learn 
a socially meaningful clustering rule to group pedestrians, 
we relied on the Structural SVM framework and designed a 
peculiar loss function able to account for singletons as well 
as for group errors. Even though the algorithm was originally 
designed to work with exact trajectories, we replicated the 
experiments on noisy tracklets extracted by a detector/tracker 
obtaining state-of-the-art results. Moreover, we proposed an 
online training version of the method, able to achieve superior 
generalization performances on crowds with variable density. 

We did note, however, that as we consider wider portions 


of the scene, the chance that many different densities groups 
coexist in different locations increases, leading to the necessity 
to learn more than one clustering rule per scene. To resolve 
this problem we plan, as future work, to learn a set of different 
distance measures and use latent variables to choose the most 
appropriate given a particular zone. Code and datasets are made 
publicly availably in order to reproduce this paper results and 
allow the community to improve the proposed method. 
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